vim for cleaning html tags
VIM for cleaning HTML tags
You have a big file, but want to extract some specific value. You can use vim substitute with regex.
Ok. Not only vim. It should work with grep also. But I am using vim, just for this post.
Yes, you could use a crawler for that. In my case, I just wanted something fast for a single file.
TL;DR
:%g!/useful-class/d
:%s/<a.*href="\([^"]*\).*\n/\1\r/g
Testing file
For testing, you can open the file below in vim:
SOME BIG HTML
<a class="useful-class" href="http://example/1">Example 1</a>
NOT USEFUL LINE
<a class="useful-class" href="http://example/2">Example 2</a>
END OF BIG HTML
In our example, we want to remove not useful lines, and just keep the URL.
Running commands
Clean not useful lines:
:%g!/useful-class/d
- %g! Make some action in all lines that does not contains a pattern
- useful-class The pattern. You just want to keep the lines with that class
- d We define that the action is delete
Remove unused content:
:%s/<a.*href="\([^"]*\).*\n/\1\r/g
- %s Substitute all lines containing a specific pattern
- Pattern (this is long):
- <a Starting with
<a
- .* With any character until…
- href=” …until it finds
href="
- \([^”]*\) and the first part (“atom”) is anything that is not quotes (
[^"]*
not quotes). The “atom” is the content inside the parentheses ((...)
) - .* and any character until…
- \n …until it finds the new line
\n
- <a Starting with
- \1\r substitute by the first “atom” and a new line
- g run the command
Done
The result is this:
http://example/1
http://example/2
Not simple, but the second time is easier to remember.
I used today for a single page long file (I copied the DOM <ul><li><a>...</a></li></ul>
) and ran the commands (and also learned) the regex commands.